Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chen Feng

SAVMap: Structure-Aided Visual Mapping of Large-Scale 2.5D Manhattan Wireframes from Panoramic Video

Jun 01, 2026

Howard Huang, Bharath Surianarayanan, Keifer Lee, Chenyu Wang, Chen Feng

Abstract:Precise 3D representations of industrial environments enable tasks such as robot localization and digital twin generation. We propose SAVMap, a method for generating a semantic wireframe map of warehouse shelf and light structures using only a panoramic video camera as the sensor input. Sequences of rectified images with shelf and ceiling-facing views are extracted from a panoramic video captured along the warehouse aisles. Using a semantic segmentation network front end, a set of sparse, semantic structure feature points (e.g., corners of shelf structures, centers of lights) are extracted from each image and tracked across the sequences. By accounting for real-world geometric relationships among the points such as Manhattan grids, a constrained structure-from-motion algorithm yields the 3D points that form a wireframe map. We demonstrate the scalability and accuracy of our proposal in a warehouse with 46 shelving rows, each with faces spanning 55\,m by 7\,m. From an hour of panoramic video content, we create wireframe maps for over 5000 shelf elements across the rows, achieving an aggregate mean absolute error of 4.8\,cm with respect to ground-truth.

* IEEE ICRA 2026

Via

Access Paper or Ask Questions

ViewSAM: Learning View-aware Cross-modal Semantics for Weakly Supervised Cross-view Referring Multi-Object Tracking

May 04, 2026

Jiawei Ge, Xintian Zhang, Jiuxin Cao, Bo Liu, Fabian Deuser, Chang Liu, Gong Wenkang, Siyou Li, Juexi Shao, Wenqing Wu(+2 more)

Abstract:Cross-view Referring Multi-Object Tracking (CRMOT) aims to track multiple objects specified by natural language across multiple camera views, with globally consistent identities. Despite recent progress, existing methods rely heavily on costly frame-level spatial annotations and cross-view identity supervision. To reduce such reliance, we explore CRMOT under weak supervision by leveraging the capabilities of foundation models. However, our empirical study shows that directly applying foundation models such as SAM2 and SAM3, even with task-specific modifications, fails to accurately understand referring expressions and maintain consistent identities across views. Yet, they remain effective at producing reliable object tracklets that can serve as pseudo supervision. We therefore repurpose foundation models as pseudo-label generators and propose a two-stage framework for weakly supervised CRMOT, using only object category labels as coarse-grained supervision. In the first stage, we design an Affinity-guided Cross-view Re-prompting strategy to refine and associate SAM3-generated tracklets across cameras, producing reliable cross-view pseudo labels for subsequent training. In the second stage, we introduce ViewSAM, a CRMOT model built upon SAM2 that explicitly models view-aware cross-modal semantics. By formulating view-induced variations as learnable conditions, ViewSAM bridges the gap between view-variant visual observations and view-invariant textual expressions, enabling robust cross-view referring tracking with only approximately 10% additional parameters. Extensive experiments demonstrate that ViewSAM achieves SOTA performance under weak supervision and remains competitive with fully supervised methods.

Via

Access Paper or Ask Questions

Scene Change Detection with Vision-Language Representation Learning

Apr 13, 2026

Diwei Sheng, Vijayraj Gohil, Satyam Gaba, Zihan Liu, Giles Hamilton-Fletcher, John-Ross Rizzo, Yongqing Liang, Chen Feng

Abstract:Scene change detection (SCD) is crucial for urban monitoring and navigation but remains challenging in real-world environments due to lighting variations, seasonal shifts, viewpoint differences, and complex urban layouts. Existing methods rely primarily on low-level visual features, limiting their ability to accurately identify changed objects amid the visual complexity of urban scenes. In this paper, we propose LangSCD, a vision-language framework for scene change detection that overcomes this single-modal limitation by incorporating semantic reasoning through language. Our approach introduces a modular language component that leverages vision-language models (VLMs) to generate textual descriptions of scene changes, which are fused with visual features through a cross-modal feature enhancer. We further introduce a geometric-semantic matching module that refines the predicted masks by enforcing semantic consistency and spatial completeness. Existing real-world scene change detection benchmarks provide only binary change annotations, which are insufficient for downstream applications requiring fine-grained understanding of scene dynamics. To address this limitation, we introduce NYC-CD, a large-scale dataset of 8,122 real-world image pairs collected in New York City with multiclass change annotations generated through a semi-automatic pipeline. Extensive experiments across multiple street-view benchmarks demonstrate that our language and matching modules consistently improve existing change-detection architectures, achieving state-of-the-art performance and highlighting the value of integrating linguistic reasoning with visual representations for robust scene change detection.

Via

Access Paper or Ask Questions

A Mamba-based Perceptual Loss Function for Learning-based UGC Transcoding

Mar 26, 2026

Zihao Qi, Chen Feng, Fan Zhang, Xiaozhong Xu, Shan Liu, David Bull

Abstract:In user-generated content (UGC) transcoding, source videos typically suffer various degradations due to prior compression, editing, or suboptimal capture conditions. Consequently, existing video compression paradigms that solely optimize for fidelity relative to the reference become suboptimal, as they force the codec to replicate the inherent artifacts of the non-pristine source. To address this, we propose a novel perceptually inspired loss function for learning-based UGC video transcoding that redefines the role of the reference video, shifting it from a ground-truth pixel anchor to an informative contextual guide. Specifically, we train a lightweight neural quality model based on a Selective Structured State-Space Model (Mamba) optimized using a weakly-supervised Siamese ranking strategy. The proposed model is then integrated into the rate-distortion optimization (RDO) process of two neural video codecs (DCVC and HiNeRV) as a loss function, aiming to generate reconstructed content with improved perceptual quality. Our experiments demonstrate that this framework achieves substantial coding gains over both autoencoder and implicit neural representation-based baselines, with 8.46% and 12.89% BD-rate savings, respectively.

* 7 pages, 6 figures

Via

Access Paper or Ask Questions

Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis

Mar 13, 2026

Chen Feng, Zhuo Zhi, Zhao Huang, Jiawei Ge, Ling Xiao, Nicu Sebe, Georgios Tzimiropoulos, Ioannis Patras

Abstract:Statistically consistent methods based on the noise transition matrix ($T$) offer a theoretically grounded solution to Learning with Noisy Labels (LNL), with guarantees of convergence to the optimal clean-data classifier. In practice, however, these methods are often outperformed by empirical approaches such as sample selection, and this gap is usually attributed to the difficulty of accurately estimating $T$. The common assumption is that, given a perfect $T$, noise-correction methods would recover their theoretical advantage. In this work, we put this longstanding hypothesis to a decisive test. We conduct experiments under idealized conditions, providing correction methods with a perfect, oracle transition matrix. Even under these ideal conditions, we observe that these methods still suffer from performance collapse during training. This compellingly demonstrates that the failure is not fundamentally a $T$-estimation problem, but stems from a more deeply rooted flaw. To explain this behaviour, we provide a unified analysis that links three levels: macroscopic convergence states, microscopic optimisation dynamics, and information-theoretic limits on what can be learned from noisy labels. Together, these results give a formal account of why ideal noise correction fails and offer concrete guidance for designing more reliable methods for learning with noisy labels.

* Accepted to CVPR2026

Via

Access Paper or Ask Questions

Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy

Mar 03, 2026

Jiahao Huang, Fengyan Lin, Xuechao Yang, Chen Feng, Kexin Zhu, Xu Yang, Zhide Chen

Abstract:The development of affective multimodal language models (MLMs) has long been constrained by a gap between low-level perception and high-level interaction, leading to fragmented affective capabilities and limited generalization. To bridge this gap, we propose a cognitively inspired three-level hierarchy that organizes affective tasks according to their cognitive depth-perception, understanding, and interaction-and provides a unified conceptual foundation for advancing affective modeling. Guided by this hierarchy, we introduce Nano-EmoX, a small-scale multitask MLM, and P2E (Perception-to-Empathy), a curriculum-based training framework. Nano-EmoX integrates a suite of omni-modal encoders, including an enhanced facial encoder and a fusion encoder, to capture key multimodal affective cues and improve cross-task transferability. The outputs are projected into a unified language space via heterogeneous adapters, empowering a lightweight language model to tackle diverse affective tasks. Concurrently, P2E progressively cultivates emotional intelligence by aligning rapid perception with chain-of-thought-driven empathy. To the best of our knowledge, Nano-EmoX is the first compact MLM (2.2B) to unify six core affective tasks across all three hierarchy levels, achieving state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization.

Via

Access Paper or Ask Questions

CRAG: Can 3D Generative Models Help 3D Assembly?

Feb 26, 2026

Zeyu Jiang, Sihang Li, Siqi Tan, Chenyang Xu, Juexiao Zhang, Julia Galway-Witham, Xue Wang, Scott A. Williams, Radu Iovita, Chen Feng(+1 more)

Abstract:Most existing 3D assembly methods treat the problem as pure pose estimation, rearranging observed parts via rigid transformations. In contrast, human assembly naturally couples structural reasoning with holistic shape inference. Inspired by this intuition, we reformulate 3D assembly as a joint problem of assembly and generation. We show that these two processes are mutually reinforcing: assembly provides part-level structural priors for generation, while generation injects holistic shape context that resolves ambiguities in assembly. Unlike prior methods that cannot synthesize missing geometry, we propose CRAG, which simultaneously generates plausible complete shapes and predicts poses for input parts. Extensive experiments demonstrate state-of-the-art performance across in-the-wild objects with diverse geometries, varying part counts, and missing pieces. Our code and models will be released.

* 10 pages, 7 figures

Via

Access Paper or Ask Questions

FC-Vision: Real-Time Visibility-Aware Replanning for Occlusion-Free Aerial Target Structure Scanning in Unknown Environments

Feb 14, 2026

Chen Feng, Yang Xu, Shaojie Shen

Abstract:Autonomous aerial scanning of target structures is crucial for practical applications, requiring online adaptation to unknown obstacles during flight. Existing methods largely emphasize collision avoidance and efficiency, but overlook occlusion-induced visibility degradation, severely compromising scanning quality. In this study, we propose FC-Vision, an on-the-fly visibility-aware replanning framework that proactively and safely prevents target occlusions while preserving the intended coverage and efficiency of the original plan. Our approach explicitly enforces dense surface-visibility constraints to regularize replanning behavior in real-time via an efficient two-level decomposition: occlusion-free viewpoint repair that maintains coverage with minimal deviation from the nominal scan intent, followed by segment-wise clean-sensing connection in 5-DoF space. A plug-in integration strategy is also presented to seamlessly interface FC-Vision with existing UAV scanning systems without architectural changes. Comprehensive simulation and real-world evaluations show that FC-Vision consistently improves scanning quality under unexpected occluders, delivering a maximum coverage gain of 55.32% and a 73.17% reduction in the occlusion ratio, while achieving real-time performance with a moderate increase in flight time. The source code will be made publicly available.

* 8 pages, 8 figures, 3 tables

Via

Access Paper or Ask Questions

Neurophysiological effects of museum modalities on emotional engagement with real artworks

Feb 02, 2026

Chen Feng, Sébastien Lugan, Karine Lasaracina, Midori Sugaya, Benoît Macq

Abstract:Museums increasingly rely on digital content to support visitors' understanding of artworks, yet little is known about how these formats shape the emotional engagement that underlies meaningful art experiences. This research presents an in-situ EEG study on how digital interpretive content modulate engagement during art viewing. Participants experienced three modalities: direct viewing of a Bruegel painting, a 180° immersive interpretive projection, and a regular, display-based interpretive video. Frontal EEG markers of motivational orientation, internal involvement, perceptual drive, and arousal were extracted using eyes-open baselines and Z-normalized contrasts. Results show modality-specific engagement profiles: display-based interpretive video induced high arousal and fast-band activity, immersive projections promoted calm, presence-oriented absorption, and original artworks reflected internally regulated engagement. These findings, relying on lightweight EEG sensing in an operational cultural environment, suggest that digital interpretive content affects engagement style rather than quantity. This paves the way for new multimodal sensing approaches and enables museums to optimize the modalities and content of their interpretive media.

* 7 pages, 4 figures - \c{opyright}IEEE EmotionSense 2026/PerCom 2026

Via

Access Paper or Ask Questions

Beyond Training for Cultural Awareness: The Role of Dataset Linguistic Structure in Large Language Models

Feb 01, 2026

Reem I. Masoud, Chen Feng, Shunta Asano, Saied Alshahrani, Philip Colin Treleaven, Miguel R. D. Rodrigues

Abstract:The global deployment of large language models (LLMs) has raised concerns about cultural misalignment, yet the linguistic properties of fine-tuning datasets used for cultural adaptation remain poorly understood. We adopt a dataset-centric view of cultural alignment and ask which linguistic properties of fine-tuning data are associated with cultural performance, whether these properties are predictive prior to training, and how these effects vary across models. We compute lightweight linguistic, semantic, and structural metrics for Arabic, Chinese, and Japanese datasets and apply principal component analysis separately within each language. This design ensures that the resulting components capture variation among datasets written in the same language rather than differences between languages. The resulting components correspond to broadly interpretable axes related to semantic coherence, surface-level lexical and syntactic diversity, and lexical or structural richness, though their composition varies across languages. We fine-tune three major LLM families (LLaMA, Mistral, DeepSeek) and evaluate them on benchmarks of cultural knowledge, values, and norms. While PCA components correlate with downstream performance, these associations are strongly model-dependent. Through controlled subset interventions, we show that lexical-oriented components (PC3) are the most robust, yielding more consistent performance across models and benchmarks, whereas emphasizing semantic or diversity extremes (PC1-PC2) is often neutral or harmful.

Via

Access Paper or Ask Questions